Estimating the Determinants of Health Literacy for Policy Prioritisation

Nathan Green

Department of Statistical Science, UCL

Outline

  • Background
  • Problems
  • Solutions
    • Multilevel regression and post-stratification (MRP)
    • Predictive comparisons
    • Prioritisation with SUCRA
  • Main results
  • Sensitivity analysis
  • Conclusion

Resources

Slides and code here: github.com/n8thangreen/data-science-in-health-talk

Background

  • Health literacy is broadly defined as the ability to access, understand, appraise, and communicate health information, enabling individuals to engage in healthcare and maintain good health throughout their lives.

  • UCL Public Policy Fellowship

  • Focusses on Newham, a diverse borough in East London that faces unique challenges
  • Identified as having some of the lowest levels of health literacy in the UK by University of Southampton (https://healthliteracy.geodata.uk/)

Previous method: Synthetic estimation

  • Weighted Logistic Regression with Synthetic Estimation (Laursen et al. (2016))
    • Frequentist single-level regression with poststratification
  • Used in geography (Gonzalez (1973); Rao and Molina (2015))
  • Can be viewed as the simpler predecessor to MRP
    • Ignores any unique local factors
    • MRP includes shrinkage via random effects
  • A linear model can be thought of equivalent to Regression-Synthetic Estimator at Unit Level
    • Like Simulated Treatment Comparison in HTA
  • Residual-adjusted synthetic estimation similar to Targeted Maximum Likelihood Estimation (TMLE) in causal inference

Problem

  • What are the ‘drivers’ of health literacy, specific to Newham?
  • Can we quantify them?
  • What would happen to health literacy if we were to intervene to effect one of these?
  • Pass kernel / root of distribution name

Data

  • Newham Residents Survey 2023 (NRS)

  • Skills for Life Survey 2011

  • Additional data

    • Labour Force Survey (LFS)
    • UK Programme for the International Assessment of Adult Competencies (PIAAC) 2023
    • Skills for Life Survey 2003

Mutlilevel regression and post-stratification

The predicted probability \(\hat{\pi}_i\) is defined as: \[ \hat{\pi}_i = \text{logit}^{-1} \left( \hat{\beta}_0 + \sum_{x} \hat{\beta}^{x}_{\gamma_x[i]} \right) \]

where \(\hat{\beta}_0\) is the intercept, \(\hat{\beta}^{x}_{\gamma_x[i]}\) are coefficients for covariates \(x\) (age, sex, eng, white, ukborn, qual, inc, job, work, home), and \(\gamma_x[i]\) represents the level or category for covariate \(x\) for individual \(i\). IMD is included as multilevel random effects \(\beta^{\text{IMD}}_j \sim \text{N}(\mu_{\text{IMD}}, \sigma_{\text{IMD}}^2)\). Priors distributions for fixed effects are normal distributions centered at zero with modest variance, and half-normal priors are used for random effect standard deviations .

The health literacy probabilities for each demographic category (cell \(c\)) are weighted by their proportion in the actual Newham population. With 11 covariates resulting in \(|\mathcal{S}|\) = 13,824 cells, the post-stratified estimate \(\hat{\pi}^{\text{mrp}}\) is: \[ \hat{\pi}^{\text{mrp}} = \sum_{c = 1}^{|\mathcal{S}|} w_c \hat{\pi}_{c} \] where \(\mathcal{S}\) is the set of all covariate combinations, \(N_c\) is the population frequency for cell \(c\), \(N\) is the total population size, and \(w_c = N_{c} / N\) are the combination weights.

Predictive comparisons 🤔

  • Terminology borrow from Gelman and Pardoe (2007). Also called predicted change in probability
  • Previously, crops up in other fields e.g. Lee (1981) (covariance adjustment mean difference)
  • Like average treatment effects without the causal interpretation

\[ \delta_u(u^{(1)}, u^{(2)}) = \frac{E(y \mid u^{(2)}) - E(y \mid u^{(1)})}{u^{(2)} - u^{(1)}} \]

Raking / Iterative proportional fitting

The Goal: Adjust survey weights (\(w\)) so that the sample distribution matches known population control totals (margins).

1. Setup

Let \(w_{ij}^{(t)}\) be the weight for cell \((i, j)\) at iteration \(t\).

We have Target Margins:

  • \(R_i\): Target total for row \(i\)
  • \(C_j\): Target total for col \(j\)

2. The Mismatch

Initially (\(t=0\)), the sample sums do not match the population targets:

\[ \sum_{j} w_{ij}^{(0)} \neq R_i \]

\[ \sum_{i} w_{ij}^{(0)} \neq C_j \]

The Iterative Algorithm

The algorithm alternates between adjusting rows and columns until convergence.

Step 1: Row Raking (Match Row Targets) \[ w_{ij}^{(t+1/2)} = w_{ij}^{(t)} \times \frac{R_i}{\sum_{k} w_{ik}^{(t)}} \]

Step 2: Column Raking (Match Column Targets) \[ w_{ij}^{(t+1)} = w_{ij}^{(t+1/2)} \times \frac{C_j}{\sum_{k} w_{kj}^{(t+1/2)}} \]

Convergence Repeat until \(\left| \sum w - \text{Target} \right| < \epsilon\).

Priority ranking

To summarize these probabilistic rankings, we adopt the metric, common in multiple-treatment meta-analysis . SUCRA represents the percentage of the maximum possible cumulative rank an intervention (in our case, an input variable) can achieve, providing a single value where a higher SUCRA indicates a better overall rank relative to others. For our model, it is given by the following \[ \text{SUCRA}_{ij} = \sum_{r=1}^{n-1} P_{ijr} / (n-1), \] where \(P_{ijr}\) is the cumulative probability for variable \(i\) at level \(j\) and rank \(r\). The mean rank is \[ \mathbb{E}[\text{rank}(i,j)] = n - \sum_{r=1}^{n-1} P_{ijr}. \]

Main results 👍

Sensitivity analyses 👍

  • Easier Maintenance: You maintain one core piece of code. Changes to the underlying logic only need to be applied in one place.

Other data sets

Reference

Green, N., Kurt, M., Moshyk, A., Larkin, J. and Baio, G. (2025), A Bayesian Hierarchical Mixture Cure Modelling Framework to Utilize Multiple Survival Datasets for Long-Term Survivorship Estimates: A Case Study From Previously Untreated Metastatic Melanoma. Statistics in Medicine, 44: e70132. https://doi.org/10.1002/sim.70132

Conclusions

Thanks 🙏

References

Gelman, Andrew, and Iain Pardoe. 2007. “Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components.” Sociological Methodology 37 (1): 23–51. https://doi.org/10.1111/j.1467-9531.2007.00181.x.
Gonzalez, Maria E. 1973. “Use and Evaluation of Synthetic Estimates.” In Proceedings of the Social Statistics Section, American Statistical Association, 33–42. American Statistical Association.
Laursen, Kamilla R., Paul T. Seed, Joanne Protheroe, Michael S. Wolf, and Gill P. Rowlands. 2016. “Developing a Method to Derive Indicative Health Literacy from Routine Socio-Demographic Data.” Journal of Health Care Communications 1 (4): 1–9. https://doi.org/10.4172/2472-1654.100033.
Lee, James. 1981. “Covariance Adjustment of Rates Based on the Multiple Logistic Regression Model.” Journal of Chronic Diseases 34 (8): 415–26. https://doi.org/10.1016/0021-9681(81)90006-4.
Rao, J. N. K., and Isabel Molina. 2015. Small Area Estimation. 2nd ed. Wiley Series in Survey Methodology. John Wiley & Sons.